4 Results - Differential Expression
To compare lexicons for male and female characters across all authors, I used DESeq2 to compare word frequencies across character gender. This process is computationally intensive, so I only included the 10,000 most frequent words, and the 3000 character with the highest read depth (the entire dataset includes 25,000 characters across 6000 books). Result are visualized in an interactive plot (fig 4.1). Differentially expressed words for female (table 6.1) and male (table 6.2) characters are included in the supplemental data.
Figure 4.1: Differential word frequency by character gender
Next, I compared lexicons for male and female authors (fig 4.2). A subjective assessment of the data suggests that these differences largely are due to genre. Differentially expressed words are included in supplemental data (tables 6.3 and 6.4 ).
Figure 4.2: Differential word frequency by author gender
Words that are more frequently associated with male characters are also more frequently used by male authors (fig 4.3). The is likely a consequence of male authors more frequently writing male characters (fig (fig:characterCounts)). I haven’t figured out the best way to control for this difference. I might try weighting word frequency by character gender frequency.
Figure 4.3: Correlation between word frequency across character and author gender
Next, I trained an ensemble learner to classify characters (table 4.2) and authors (table 4.1) by gender using DESeq2 normalized word counts. This model classified author gender with disturbing accuracy. A bad interpretation wouldn’t examine confounding variables such as genre. I’m not sure how best to do this. Goodreads.com tags are probably rather biased, and not an impartial estimator of genre. I might try classifying books by genre using their text, and independently classifying author by gender within a genre.
An alternative method might be to adapt another technique from single-cell RNA-sequencing. I might be able to position books on a manifold in genre-space, and examine how author genders are distributed along the manifold.
| Predicted Male | Predicted Female | |
|---|---|---|
| True Male | 1522 | 298 |
| True Female | 292 | 1736 |
| Predicted Male | Predicted Female | |
|---|---|---|
| True Male | 1907 | 359 |
| True Female | 677 | 879 |